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Data is an important property of various organizations and it is intellectual 
property of organization. Every organization includes sensitive data as 
customer information, financial data, data of patient, personal credit card data 
and other information based on the kinds of management, institute or 
industry. For the areas like this, leakage of information is the crucial problem 
that the organization has to face, that poses high cost if information leakage 
is done. All the more definitely, information leakage is characterize as the 
intentional exposure of individual or any sort of information to unapproved 
outsiders. When the important information is goes to unapproved hands or 
moves towards unauthorized destination. This will prompts the direct and 
indirect loss of particular industry in terms of cost and time. The information 


leakage is outcomes in vulnerability or its modification. So information can 
be protected by the outsider leakages. To solve this issue there must be an 
efficient and effective system to avoid and protect authorized information. 
From not so long many methods have been implemented to solve same type 
of problems that are analyzed here in this survey. This paper analyzes little 
latest techniques and proposed novel Sampling algorithm based data leakage 
detection techniques. 


Intrusion detection system 
Sampling algorithm 
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1. INTRODUCTION 

Data leakage in nothing but a getting access of a important data of a person or of an organization by 
unauthorized user. Important data of organization can have information customer, business plans, financial 
condition, information of patients, credit-card data of employs and so on depending on the business or a 
person or industry. But in various cases owner must share its data with its employees like when an employees 
are working for home using his own device or with the customers etc. 

This will raise a possibility of important data falling into wrong hands. Which may be the result of 
some accident or a mistake by the person may be from the one who is working in the organization or by the 
outsiders such as hackers, can affect the organization. The organization can suffer from major damage due to 
data leak. Present systems for data leak detection are depending on set intersection. Set intersection is done 
on two sets of n-grams, one from the content and one from sensitive data. The set intersection gives the 
number of sensitive n-grams appearing in the content. 

In such a way data leakage become the huge issue for various individual user, industries and 
institutes. A huge problem of the honesty of the users of those organization/systems is raised. It will tough 
for the person to find out the data leak in the various users. It also can create ethical problems in working 
organization. The potential harm and antagonistic results of an information leak event can be requested into 
the two classes: direct and indirect loss. Direct loss link to unmistakable harm that is definitely not hard to 
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measure and assess quantitatively. Indirect loss is much harder to measure and has a substantially more 
extensive effect regarding cost and time. Direct loss alludes to breaking the guidelines can come about up 
loss of future sales, expenses of examination and medicinal/rebuilding charges. Indirect loss can be result in 
less share-cost as a consequence of the negative publicity; can influence the reputation of organization or 
Intellectual Property to competitors. 

Novel solution is for searching the transformed leakage in information using a sequence alignment 
algorithm, which is executed on the sampled delegate data sequence as well as the sampled substances. The 
alignment process scores are showing the measure of delicate information having in the substance. Our 
solution which is alignment-based, count the order of n-grams. It likewise takes care of arbitrary varieties of 
patterns without an unequivocal particular of all conceivable variety patterns. 


2. LITERATURE REVIEW 

In this paper [1], author makes use of sequence alignment method for searching complex data- 
leakage patterns. This algorithm is engaged for recognizing long as well as important data patterns. This 
identification is paired with a sampling algorithm that allows one to look at the similitude of two 
independently tested successions. This structure accomplishes great discovery exactness in perceiving 
transformed leakage. 

Paper [2] author, implemented two algorithms for searching and transformed leakage information. 
This framework fulfills high recognition exactness and finds transformed leakage appeared differently in 
relation to the cutting edge inspection systems. They parallelize their design on graphics preparing unit as 
well as exhibit the solid scalability of their detection solution needed by a sizable association. In paper [3] 
authors have designs fuzzy fingerprint, which is a privacy-preserving data-leak detection system also 
provides its realization. By making use of special digests, the exposure of the vital data is kept to very less 
while detection. Authors have conducted few tests to conform the accuracy, privacy, as well as efficiency of 
our solutions. 

In paper [4] author developed the Aquifer security system that assigns host export limitations on all 
data taken as part of a user interface (UI) workflow. Key understanding was that when applications in 
modern working frameworks offer data, it is a piece of an enormous work process to play out a user task. 
Each application on the UI work process is a potential information owner, and in addition thus can add to the 
security limitations. The restrictions are held with data as it is composed to storage and propagated to future 
UI work forms that read it. In doing all things considered they engage applications to sensibly hold control of 
their data after it has been shared as a major aspect of the client's tasks. 

In paper [5] authors present Attire: an app for computers as well as smart phones which shows the 
user with an avatar. Attire conveys real-time data exposure in a light weight and unobtrusive manner via 
updating to the avatar’s clothing. In paper [6] authors given the Data-Driven Semi-Global Alignment 
(DDSGA), DDSGA method. From the point of security effectiveness, DDSGA increase the scoring systems 
by adopting distinct alignment parameters for every single user. Also, it endures few transformations in user 
command sequences by permitting few changes in the low-level representation of the commands 
functionality. It additionally adjusts to modification in the user conduct by upgrading the signature of a user 
as per its present behavior. To optimize the run time overhead, DDSGA reduce the alignment overhead as 
well as parallelizes the detection and also modify. 

In paper [7] author, proposed novel method for getting richer semantics of the user’s determined. 
The technique is depending on the observation which for most text-based applications, the user’s determined 
are shown fully on screen, as text, as well as the user will do some modifications if what is on screen is not 
what he needed. Depending on this concept, development of prototype known as Gyrus that enforces right 
working of applications by taking user determinant is done. By making use of Gyrus, representation of 
stopping destructive activities which can modify the host system to forward destructive traffic, like social 
network impersonation attacks, as well as online financial services fraud is done. The evaluation outcome 
shows that Gyrus successfully prohibits modern malware, as well as study demonstrated that it would be very 
tough for future attacks to defeat it. At last, the performance analysis demonstrated that Gyrus is a countable 
option for positioning on standalone pce with continues user interaction. Gyrus fills an important gap, 
enabling security actions that taking user concentration in finding the legitimacy of network traffic. 

In paper [8] authors implemented a domain-specific concurrency model that backs a large class of 
IDS analysis not depends on a particular detection technique. Implemented technique divides the stream of 
network events in subsets that the IDS will process not related, as well as, while making sure each subset has 
each event relevant to a detection case. Proposed partitioning method is based on the concept of detection 
scope, i.e., the less “slice” of traffic that a detector must study for performing its function. As this concept has 
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some common applicability, designed model will support simple, per-flow detection technique and more 
complex, high-level detectors. 

Findings of author [9], the introduction of essential data is not basic because of information change 
in the content. Transformations (for example, insertion, and deletion) results in significantly unpredictable 
leakage patterns. Present automata-based string coordinating algorithms are illogical for finding transformed 
data leakage as a result of its formidable complexity nature while exhibiting the required consistent 
expressions. They create two novel algorithms for recognizing long and also wrong data leakage. Their 
framework achieves high detection precision in perceiving changed breaks contrasted and the best in class 
inspection techniques. They parallelize our design on graphics processing unit and in addition demonstrated 
the solid scalability of data leakage detection arrangement examining enormous information. 

Paper [10] authors given that number of the apparent distance metrics utilized for computing 
behavioral similarity between network hosts fail to capture the semantic significance imbued by network 
protocols. Moreover, they also tend to neglect long-term temporal structure of the objects being counted. To 
consider the role of these semantic as well as temporal attributes, they create another behavioral distance 
metric for network hosts as well as compare its execution with a metric which disregards information like 
this. Specifically, they propose semantically important metrics for common data types found in network data, 
indicate how these metrics can be consolidated to treat network information as a unified metric space, as well 
as depict a temporal sequencing algorithm which captures long-term causal relationships. 

Shoulin Yin et al. [11] introduced novel concept searchable asymmetric encryption, which is useful 
for security and search operations on encrypted data. It greatly enhances the information protection, and 
prevent the leakage of the user's search criteria-Search Pattern. In paper [12] authors describes the Kaman- 
Kerberos assistant mobile ad-hoc network (KAMAN) protocol to avoid users information leak in cloud 
environment for virtual side channel attack. Moez Altayeb et al. [13] described the concepts of radiation 
leaks and data in wireless sensor network. To locate leakage station and control the stations power 
consumption by sending a special command to it from server node. 

Table 1 shows the various authors papers details with method used, advantages and disadvantages. 


Table 1. Various Authors Papers Details with Method Used, Advantages and Disadvantages 


Sr. No. Title Paper Details Method Used Advantages Disadvantages 
1. Fast Detection of Utilize sequence Comparable This prototype data-movement 
Transformed Data alignment techniques sampling algorithm provides substantial tracking 
Leaks for detecting complex alignment as wellas speedup and indicates approached is not 
data-leak patterns. sampling-oblivious high scalability of the used. 
algorithm. design. 
2: Rapid screening of Design two novel Sequence alignment This technique has Time Consuming 
transformed data algorithms for algorithm high level of precision Process. 
leaks with efficient searching long as well in finding transformed 
algorithms and as transformed information leaks 
parallel computing information leaks. compared with the 
state-of-the-art set 
intersection technique. 
3. Privacy-Preserving Provides a privacy MapReduce Capability to System is not 
Detection of Sensitive preserving data-leak algorithm arbitrarily scale as developed for 
Data Exposure detection (DLD) well as use of public intentional 
answer for the resources for the information 
problem in which a process exfiltration, 
special set of which typically 
important information uses strong 
digests is utilized in encryption 
detection. 
4. Preventing accidental Develops Aquiferasa Aquifer as a policy In Aquifer, application Malicious 
data disclosure in policy system as well system as well as developers give applications are 
modern operating as system for avoiding framework secrecy restrictions not taken into 
systems (2013) accidental data For avoiding which protect the considerations are 
disclosure in modern accidental data entire user interface not taken into 
operating systems. disclosure in modern workflow during the consideration 
operating systems user task. 
5. Attire: Conveying Developed Attire: an Attire: Mobile app Attire passes real-time This system can 


information exposure 
through avatar 


app for computers as 
well as smart phones 


data exposure in a 
lightweight as well as 


be modified to 
handle other sort 


apparel (2013) which displays the unobtrusive manner of data with 
user with an avatar. via updating to the location, Such as 
avatar's clothing. views of 
photographs, 


‘mentions’ as well 
as ‘retweets' of 
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tweets, comments 
on status 
messages, etc. 

6. DDSGA: A data- Given the Data-Driven | DDSGA approach DDSGA results Need to detect 
driven semi-global Semi-Global improve both the hit masquerade 
alignment approach Alignment, DDSGA ratio as well as false attacks in cloud 
for detecting method. From the positive rates with an environment by 
masquerade attacks security effectiveness acceptable calculation improving this 

view point, DDSGA overhead. CIDS framework 
upgrades the scoring 

systems by adopting 

distinct alignment 

parameters for every 

user. 

de Gyrus: A framework Develop a way to Gyrus framework Gyrus successfully The present 
for user-intent break this cycle by stops modern malware design can be 
monitoring of text- making sure which a adapted to 
based networked system’s behavior operate in a cloud 
applications matches the user’s computing model 

intent. in which the 
remote host is an 
instance in an 
IaaS cloud. It 
must implement 
Gyrus on other 
platforms. 

8. Beyond pattern Given a novel domain- Intrusion detection This technique It is complex 
matching: A specific concurrency systems (IDSs) correctly partitions process 
concurrency model model which solves existing sequential 
for stateful deep this issue by having IDS studiesby having 
packet inspection the notion of detection accuracy, while 

scope: a unit for exploiting the network 
partitioning network traffic’s inherent 
traffic like the traffic concurrency potential 
having in every for throughput 
resulting "slice" is upgrades. 
independent for 

detection purposes 

9 Rapid and parallel Developed two novel Sequence This technique is can Time consuming 
content screening for algorithms for finding comparison be used for big Process. 
detecting transformed long as well as inexact technique dataanalytics as well 
data exposure. information leaks. asfinding transformed 

sensitive information 
exposure. 

10 On measuring the Proposed a behavioral Dynamic Time Method gives This method not 
similarity of network distance metric for Warping (DTW) consistent and useful work on allow for 
hosts: Pitfalls, new network hosts as well characterizations of localized 
metrics, and asanalyzed its host behavior reordering of 
empirical analyses performance to a compared with the L1 points 

metric metric. 
whichneglectsdata like 
that. 

11 Distributed In this paper author Polyfunctional To get higher Complex search 
Searchable discussed concept of efficiency and security query and the 
Asymmetric prevention of leakage Searchable in information retrieva leaking of search 
Encryption of the user's search encryption. increases 

criteria-Search Pattern. execution time. 

12 KAMAN Protocol for In this paper author Virtual side channel Kaman- Kerberos Work is presented 
Preventing Virtual provides solution for attack is described assistant mobile ad- controlled 
Side Channel cloud environment to and how this attack hoc network environment, in 
Attacks in Cloud avoid the leakage of provides the users (KAMAN) protocol is real cloud 
Environment user information. credential to access used to avoid users environment, how 

user information. informarmation leak. provided solution 
is effective not 
discussed. 

13 Wireless Sensor In this paper, when To locate leakage GSM network is used Effect of 
Network for wireless sensor station and control to avoid radiation leak environmental 


Radiation Detection 


network (WSN) send 
or recive data from 
other nodes. It reports 
radiation leaks, in 
addition to the data. 


the stations power 
consumption by 
sending a special 
command to it from 
server node. 


and data leak. 


paramers, while 
transmitting data 
in wirless sensor 
network. 
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3. RESEARCH METHOD 

Four ways, we can secure organization data: 
Identify Critical data: organization identify sensitive data of organization and use latest algorithm or software 
to protect critical data. Based on organization policies classify data, educate the users and provide sensitive 
data to employee based on access control mechanism. 
Monitor Network Access: to monitor all network traffic and generate real time network access report. System 
find anomalous behavior pattern and auto notify this pattern to network administrator and block access 
control suspicious user. 
Adaptive Security Mechanism: to provide adaptive to change security mechanism as per attacks for data leak. 
Apply Security or Encryption Techniques: Apply excretion algorithm to critical sensitive information. 


4. PROPOSED SYSTEM 
The architecture and flow of the proposed system in Figure 1: 


Input Dataset 
Fetch Sensitive Data 


Subsequence preserving Sampling 
Algorithm 


Sample List 


Recurrence Relation 


Dynamic Algorithm 


Sensitivity Score Matrix 


Data Leakage Detection 


Figure 1. The architecture and flow of the proposed system 


In the proposed system initially user browses the input dataset and fetches the sensitive data from 
input dataset. For generating the sample list, subsequence preserving sampling algorithm is used to generate 
sampling algorithm. 

If given string is p and its substring is q, this is denoted by p © q), then p 0 is also a substring of 
q 0 (p 0 & q 0), where p 0 is a sensitive data of p sample and sampled sensitive data sample of p, and q 0 is a 
sensitive sampled data of q. 

After that matrix for sensitivity score is calculated by recurrence relation dynamic programming 
algorithm. After that system is compared threshold value with sensitivity score value if sensitivity score is 
greater than threshold value then data leak is detected, otherwise data is not leak. 


5. RESULTS AND DISCUSSION 

Sampling algorithm based on context-aware selection. Selection decision of sample list depends on 
selection function, how selection compare with surrounding data. Sampling algorithm gives results as a 
deterministic and preserving the subsequence. 

Recurrence Relation in Dynamic Programming algorithms works on compact sample list. This 
algorithm is alignment-based to detect data leaks using order-aware comparison and provides high tolerance 
to pattern variations. Also supports partial leak detection with high data leak detection. 
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System is implemented in C++ and dataset is used Enron [14], that contains 150 users email data 
with full header and bodies. To find data leak detection rate and false positive rate, using Equations (1) 
and (2). 


TP 
TP+FN 


Detection rate = 


(1) 


FP 
FP+TP 


False positive rate = (2) 
where TP is true positive, FP is false positive, FN is false negative and TN is true negative. Table 2 shows 
data leak and no data leak in terms of above equations. 


Table 2. Data leak in Terms of TP, FP and no Data leak in Terms of FN, TN 
True Leak No Leak 
Data Leak Detected TP FP 
No Data Leak Detected FN TN 


6. CONCLUSION 

In real world, the organizations are facing the problem of data leakage. The data may be seen in 
other laptops or websites. From this survey we conclude that the data leakage detection system is largely 
useful for protecting the illegal use of data of various industries. So there is need to develop a content 
inspection method which detect leaks of important data in the content of files or network traffic. Also 
proposed system is useful to detect modification in data. In future such systems are necessary to detect data 
leak of personal, finance transactions, online shopping, and social media and so on. 
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